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An End-to-End Neural Network for Polyphonic 
Piano Music Transcription 

Siddharth Sigtia, Emmanouil Benetos, and Simon Dixon 


Abstract —We present a supervised neural network model for 
polyphonic piano music transcription. The architecture of the 
proposed model is analogous to speech recognition systems and 
comprises an acoustic model and a music language model. The 
acoustic model is a neural network used for estimating the 
probabilities of pitches in a frame of audio. The language model is 
a recurrent neural network that models the correlations between 
pitch combinations over time. The proposed model is general and 
can be used to transcribe polyphonic music without imposing 
any constraints on the polyphony. The acoustic and language 
model predictions are combined using a probabilistic graphical 
model. Inference over the output variables is performed using the 
beam search algorithm. We perform two sets of experiments. We 
investigate various neural network architectures for the acoustic 
models and also investigate the effect of combining acoustic and 
music language model predictions using the proposed architec¬ 
ture. We compare performance of the neural network based 
acoustic models with two popular unsupervised acoustic models. 
Results show that convolutional neural network acoustic models 
yields the best performance across all evaluation metrics. We 
also observe improved performance with the application of the 
music language models. Finally, we present an efficient variant 
of beam search that improves performance and reduces run¬ 
times by an order of magnitude, making the model suitable for 
real-time applications. 

Index Terms —Automatic Music Transcription, Deep Learning, 
Recurrent Neural Networks, Music Language Models. 

EDICS Category: AUD-MSP, AUD-MIR, MLR-DEEP 

1. Introduction 

UTOMATIC Music Transcription (AMT) is a fundamen¬ 
tal problem in Music Information Retrieval (MIR). AMT 
aims to generate a symbolic, score-like transcription, given a 
polyphonic acoustic signal. Music transcription is considered 
to be a difficult problem even by human experts and current 
music transcription systems fail to match human performance 
ifTl . Polyphonic AMT is a difficult problem because concur¬ 
rently sounding notes from one or more instruments cause a 
complex interaction and overlap of harmonics in the acoustic 
signal. Variability in the input signal also depends on the 
specific type of instrument being used. Additionally, AMT 
systems with unconstrained polyphony have a combinatori- 
ally very large output space, which further complicates the 
modeling problem. Typically, variability in the input signal is 
captured by models that aim to learn the timbral properties 
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of the instrument being transcribed lO, ||3l, while the issues 
relating to a large output space are dealt with by constraining 
the models to have a maximum polyphony I4l. ISll. 

The majority of current AMT systems are based on the 
principle of describing the input magnitude spectrogram as 
a weighted combination of basis spectra corresponding to 
pitches. The basis spectra can be estimated by various tech¬ 
niques such as non-negative matrix factorisation (NMF) and 
sparse decomposition. Unsupervised NMF approaches El, m 
aim to learn a dictionary of pitch spectra from the training 
examples. However purely unsupervised approaches can often 
lead to bases that do not correspond to musical pitches, 
therefore causing issues with interpreting the results at test 
time. These issues with unsupervised spectrogram factorisation 
methods are addressed by incorporating harmonic constraints 
in the training algorithm 0, El. Spectrogram factorisation 
based techniques were extended with the introduction of 
probabilistic latent component analysis (PLCA) lITOl . PLCA 
aims to fit a latent variable probabilistic model to normalised 
spectrograms. PLCA based models are easy to train with 
the expectation-maximisation (EM) algorithm and have been 
extended and applied extensively to AMT problems uni, a. 

As an alternative to spectrogram factorisation techniques, 
there has been considerable interest in discriminative ap¬ 
proaches to AMT. Discriminative approaches aim to directly 
classify features extracted from frames of audio to the output 
pitches. This approach has the advantage that instead of 
constructing instrument specific generative models, complex 
classifiers can be trained using large amounts of training 
data to capture the variability in the inputs. When using 
discriminative approaches, the performance of the classifiers is 
dependent on the features extracted from the signal. Recently, 
neural networks have been applied to raw data or low level 
representations to jointly leam the features and classifiers for 
a task ca. Over the years there have been many experiments 
that evaluate discriminative approaches for AMT. Poliner and 
Ellis ca use support vector machines (SVMs) to classify 
normalised magnitude spectra. Nam et. al. ca superimpose 
an SVM on top of a deep belief network (DBN) in order to 
learn the features for an AMT task. Similarly, a bi-directional 
recurrent neural network (RNN) is applied to magnitude 
spectrograms for polyphonic transcription in Ha. 

In large vocabulary speech recognition systems, the infor¬ 
mation contained in the acoustic signal alone is often not 
sufficient to resolve ambiguities between possible outputs. A 
language model is used to provide a prior probability of the 
current word given the previous words in a sentence. Statistical 
language models are essential for large vocabulary speech 



2 


recognition csi. Similarly to speech, musical sequences ex¬ 
hibit temporal structure. In addition to an accurate acoustic 
model, a model that captures the temporal structure of music or 
a music language model (MLM), can potentially help improve 
the performance of AMT systems. Unlike speech, language 
models are not common in most AMT models due to the 
challenging problem of modelling the combinatorially large 
output space of polyphonic music. Typically, the outputs of 
the acoustic models are processed by pitch specific, two-state 
hidden Markov models (HMMs) that enforce smoothing and 
duration constraints on the output pitches O, |[T3]| . However, 
extending this to modelling the high-dimensional outputs of 
a polyphonic AMT system has proved to be challenging, 
although there are some studies that explore this idea. A 
dynamic Bayesian network is used in ifTTll . to estimate prior 
probabilities of note combinations in an NMF based transcrip¬ 
tion framework. Similarly in CD, a recurrent neural network 
(RNN) based MLM is used to estimate prior probabilities of 
note sequences, alongside a PLCA acoustic model. A sequence 
transduction framework is proposed in GSl, where the acoustic 
and language models are combined in a single RNN. 

The ideas presented in this paper are extensions of the 
preliminary experiments in ll2Qt . We propose an end-to-end 
architecture for jointly training both the acoustic and the lan¬ 
guage models for an AMT task. We evaluate the performance 
of the proposed model on a dataset of polyphonic piano music. 
We train neural network acoustic models to identify the pitches 
in a frame of audio. The discriminative classifiers can in 
theory be trained on complex mixtures of instrument sources, 
without having to account for each instrument separately. 
The neural network classifiers can be directly applied to 
the time-frequency representation, eliminating the need for 
a separate feature extraction stage. In addition to the deep 
feed-forward neural network (DNN) and RNN architectures 
in 1201 , we explore using convolutional neural nets (Con- 
vNets) as acoustic models. ConvNets were initially proposed 
as classifiers for object recognition in computer vision, but 
have found increasing application in speech recognition 1211 . 
1221 . Although ConvNets have been applied to some problems 
in MIR |23|, l24l . they remain unexplored for transcription 
tasks. We also include comparisons with two state-of-the- 
art spectrogram factorisation based acoustic models m, i] 
that are popular in AMT literature. As mentioned before, 
the high dimensional outputs of the acoustic model pose 
a challenging problem for language modelling. We propose 
using RNNs as an alternative to state space models like 
factorial HMMs 1^ and dynamic Bayesian networks ITTl . 
for modeling the temporal structure of notes in music. RNN 
based language models were first used alongside a PLCA 
acoustic model in CD. However, in that setup, the language 
model is used to iteratively refine the predictions in a feedback 
loop resulting in a non-causal and theoretically unsatisfactory 
model. In the hybrid framework, approximate inference over 
the output variables is performed using beam search. However 
beam search can be computationally expensive when used 
to decode long temporal sequences. We apply the efficient 
hashed beam search algorithm proposed in 1^ for inference. 
The new inference algorithm reduces decoding time by an 


order of magnitude and makes the proposed model suitable 
for real-time applications. Our results show that convolutional 
neural network acoustic models outperform the remaining 
acoustic models over a number of evaluation metrics. We 
also observe improved performance with the application of 
the music language models. 

The rest of the paper is organised as follows: Section 
describes the neural network models used in the experiment. 
Section |I^ discusses the proposed model and the inference al¬ 
gorithm, Section W details model evaluation and experimental 
results. Discussion, future work and conclusions are presented 
in Section Ivl 


H. Background 

In this section we describe the neural network models used 
for the acoustic and language modelling. Although neural 
networks are an old concept, they have recently been applied 
to a wide range of machine learning problems with great 
success (121. One of the primary reasons for their recent 
success has been the availability of large datasets and large- 
scale computing infrastructure fTTX , which makes it feasible to 
train networks with millions of parameters. The parameters of 
any neural network architecture are typically estimated with 
numerical optimisation techniques. Once a suitable cost func¬ 
tion has been defined, the derivatives of the cost with respect 
to the model parameters are found using the backpropagation 
algorithm (^ and parameters are updated using stochastic 
gradient descent (SGD) (291 . SGD has the useful property 
that the model parameters are iteratively updated using small 
batches of data. This allows the training algorithm to scale 
to very large datasets. The layered, hierarchical structure of 
neural nets makes end-to-end training possible, which implies 
that the network can be trained to predict outputs from low- 
level inputs without extracting features. This is in contrast to 
many other machine learning models whose performance is 
dependent on the features extracted from the data. Their ability 
to jointly learn feature transformations and classifiers makes 
neural networks particularly well suited to problems in MIR 
(301. 

A. Acoustic Models 

1) Deep Neural Networks: DNNs are powerful machine 
learning models that can be used for classification and regres¬ 
sion tasks. DNNs are characterised by having one or more 
layers of non-linear transformations. Formally, one layer of a 
DNN performs the following transformation: 

hi+i = f{Wihi^hi). ( 1 ) 

In Equation Wi^hi are the weight matrix and bias for 
layer /, 0 < / < L and / is some non-linear function 
that is applied element-wise. For the first layer, ho = x, 
where x is the input. In all our experiments, we fix / to 
be the sigmoid function (f{x) = )• The output of 

the final layer Hl is transformed according to the given 
problem to yield a posterior probability distribution over the 
output variables P{y\x,0). The parameters 0 = 
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Fig. 1. Neural network architectures for acoustic modelling. 


are numerically estimated with the backpropagation algorithm 
and SGD. Figure shows a graphical representation of the 
DNN architecture, the dashed arrows represent intermediate 
hidden layers. For acoustic modelling, the input to the DNN 
is a frame of features, for example a magnitude spectrogram 
or the constant Q transform (CQT) and the DNN is trained to 
predict the probability of pitches present in the frame p{yt\xt) 
at some time t. 

2) Recurrent Neural Networks: DNNs are good classi¬ 
fiers for stationary data, like images. However, they are not 
designed to account for sequential data. RNNs are natural 
extensions of DNNs, designed to handle sequential or temporal 
data. This makes them more suited for AMT tasks, since con¬ 
secutive frames of audio exhibit both short-term and long-term 
temporal patterns ED RNNs are characterised by recursive 
connections between the hidden layer activations at some time 
t and the hidden layer activations at t — 1, as shown in Figure 
\Tb\ Formally, the hidden layer of an RNN at time t performs 
the following computation: 

hUi = f{W/hj + Wfhl^ + bi). (2) 

In Equation W/ is the weight matrix from the input to 
the hidden units, W[ is the weight matrix for the recurrent 
connection and bi are the biases for layer 1. From Equation 
1^ we can see that the recursive update of the hidden state 
at time t, implies that ht is implicitly a function of all the 
inputs till time t, x^. Similar to DNNs, RNNs are made 
up of one or more layers of hidden units. The outputs of 
the final layer are transformed with a suitable function to 
yield the desired distribution over the ouputs. The RNN 

parameters ^ = | ^ | ^re calculated using the back 

propagation through time algorithm (BPTT) 1321 and SGD. 
For acoustic modelling, the RNN acts on a sequence of input 
features to yield a probability distribution over the outputs 
P{yt\xl), where = {xo,xi,.. .,xt}. 

3) Convolutional Networks: ConvNets are neural nets with 
a unique structure. Convolutional layers are specifically de¬ 
signed to preserve the spatial structure of the inputs. In a 


convolutional layer, a set of weights act on a local region 
of the input. These weights are then repeatedly applied to the 
entire input to produce a feature map. Convolutional layers 
are characterised by the sharing of weights across the entire 
input. As shown in Figure ConvNets are comprised of 
alternating convolutional and pooling layers, followed by one 
or more fully connected layers (same as DNNs). Formally, the 
repeated application of the shared weights to the input signal 
constitutes a convolution operation: 

hj k — ^ —1 bjf (3) 

r 

The input x is a vector of inputs from different channels, 
for example RGB channels for images. Formally, x = 
{xo, xi,...}, where each input Xi represents an input channel. 
Each input band Xi has an associated weight matrix. All the 
weights of a convolutional layer are collectively represented 
as a four dimensional tensor. Given an m x n region from a 
feature map h, the max pooling function returns the maximum 
activation in the region. At any time t, the input to the ConvNet 
is a window of 2/c + 1 feature frames The outputs 

of the final layer yield the posterior distribution distribution 

There are several motivations for using ConvNets for 
acoustic modelling. There are many experiments in MIR that 
suggest that rather than classifying a single frame of input, 
better prediction accuracies can be achieved by incorporating 
information over several frames of inputs 1^ . 13^ . |[34l . 
Typically, this is achieved either by applying a context window 
around the input frame or by aggregating information over 
time by calculating statistical moments over a window of 
frames. Applying a context window around a frame of low 
level spectral features, like the short time fourier transform 
(STFT) would lead to a very high dimensional input, which is 
impractical. Secondly, taking mean, standard deviation or other 
statistical moments makes very simplistic assumptions about 
the distribution of data over time in neighbouring frames. 
ConvNets, due to their architecture ca, can be directly 
applied to several frames of inputs to learn features along 
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both, the time and the frequency axes. Additionally, when 
using an input representation like the CQT, ConvNets can 
learn pitch-invariant features, since inter-harmonic spacings in 
music signals are constant across log-frequency. Finally, the 
weight sharing and pooling architecture leads to a reduction 
in the number of ConvNet parameters, compared to a fully 
connected DNN. This is a useful property given that very large 
quantities of labelled data are difficult to obtain for most MIR 
problems, including AMT. 

B. Music Language Models 

Given a sequence y = use the MLM to define a prior 

probability distribution P{y). yt is a high-dimensional binary 
vector that represents the notes being played at t (one time-step 
of a piano-roll representation). The high dimensional nature 
of the output space makes modelling yt a challenging prob¬ 
lem. Most post-processing algorithms make the simplifying 
assumption that all the pitches are independent and model their 
temporal evolution with independent models ca. However, 
for polyphonic music, the pitches that are active concurrently 
are highly correlated (harmonies, chords). In this section, we 
describe the RNN music language models first introduced in 

ED. 

1) Generative RNN: The RNNs defined in the earlier 
sections were used to map a sequence of inputs x to a sequence 
of outputs y. At each time-step t, the RNN outputs the 
conditional distribution P(^t|^o)- However RNNs can be used 
to define a distribution over some sequence y by connecting 
the outputs of the RNN at t — 1 to the inputs of the RNN at 
t, resulting in a distribution of the form: 

Piy) = Piyo)'[[P{yM~') (4) 

t>0 

Although an RNN predicts yt conditioned on the high 
dimensional inputs yl~^, the individual pitch outputs yt{i) 
are independent, where i is the pitch index (Section |IV-C| ). 
As mentioned earlier, this is not true for polyphonic music. 
Boulanger-Lewandowski et. al. Ea demonstrate that rather 
than predicting independent distributions, the parameters of 
a more complicated parametric output distribution can be 
conditioned on the RNN hidden state. In our experiments, we 
use the RNN to output the biases of a neural autoregressive 
distribution estimator (NADE) (SSl. 

2) Neural Autogressive Distribution Estimator: The 
NADE is a distribution estimator for high dimensional binary 
data 1^ . The NADE was initially proposed as a tractable 
alternative to the restricted Boltzmann machine (RBM). The 
NADE estimates the joint distribution over high dimensional 
binary variables as follows: 

P{x) = II-P(a;i|a;o“^)- 

i 

The NADE is similar to a fully visible sigmoid belief network 
E21, since the conditional probability of Xi is a non-linear 
function of x^. The NADE computes the conditional distribu¬ 
tions according to: 


(5) 


P{xi\xi-^)=a{Vihi + P,) ( 6 ) 

where W, V are weight matrices, is a submatrix of 

W that denotes the first i — 1 columns and bh , by are the 
hidden and visible biases, respectively. The gradients of the 
likelihood function P{x) with respect to the model parameters 
0 = {W^V^bh^by} can be found exactly, which is not possible 
with RBMs . This property allows the NADE to be readily 
combined with other models and the models can be jointly 
trained with gradient based optimisers. 

3) RNN-NADE: In order to learn high dimensional, tem¬ 
poral distributions for the MLM, we combine the NADE and 
an RNN, as proposed in 1351 . The resulting model yields a 
sequence of NADEs conditioned on an RNN, that describe a 
distribution over sequences of polyphonic music. The joint 
model is obtained by letting the parameters of the NADE 
at each time step be a function of the RNN hidden state 
0%ade — is the hidden state of final layer of the 

RNN (Equation at time t. In order to limit the number of 
free parameters in the model, we only allow the NADE biases 
to be functions of the RNN hidden state, while the remaining 
parameters (IE, V) are held constant over time. We compute 
the NADE biases as a linear transformation of the RNN hidden 
state plus an added bias term ES: 

bl = + Wiht (7) 


bh — bfi + W2ht ( 8 ) 

lEi and IE 2 are weight matrices from the RNN hidden state 
to the visible and hidden biases, respectively. The gradients 
with respect to all the model parameters can be easily com¬ 
puted using the chain rule and the joint model is trained using 
the BPTT algorithm ||35]| . 


III. Proposed Model 


In this section we review the proposed neural network 
model for polyphonic AMT. As mentioned earlier, the model 
comprises an acoustic model and a music language model. 
In addition to the acoustic models in ll2Qli . we propose the 
use of ConvNets for identifying pitches present in the input 
audio signal and compare their performance to various other 
acoustic models (Section IV-P| ). The acoustic and language 
models are combined under a single training objective using a 
hybrid RNN architecture, yielding an end-to-end model for 
AMT with unconstrained polyphony. We first describe the 
hybrid RNN model, followed by a description of the proposed 
inference algorithm. 


A. Hybrid RNN 

The hybrid RNN is a graphical model that combines the 
predictions of any arbitrary frame level acoustic model, with 
an RNN-based language model. Let X — be a sequence 
of inputs and let y = y^ be the corresponding transcriptions. 
The joint probability of y,x can be factorised as follows: 


hi = a{W..,<ixl^ + bh) 
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Fig. 2. Graphical Model of the Hybrid Architecture 


P{y,x) = P{yo...yT,xo--.XT) (9) 

T 

= P{yo)P{xo\yo)Y[P{yt\yl~^)P{xt\yt)- 

t=l 


The factorisation in Equation makes the following indepen¬ 
dence assumptions: 


P{yt\yl-\xl-^)=P{yt\yl-^) 

(10) 

P{xt\yl,xl~^) =P{xt\yt) 

(11) 


These independence assumptions are similar to the as¬ 
sumptions made in HMMs (SEl . Figure is a graphical 
representation of the hybrid model. In equation]^ P{xt\yt) 
is the emission probability of an input, given output yt. Using 
Bayes’s rule, the conditional distribution can be written as 
follows: 


P(y\x) (X P{yo\xo)J\_P{yt\yl ^)P{yt\xt), (12) 


t = l 


where the marginals P{yt) and priors P{yo), P{xo) are 
assumed to be fixed w.r.t. the model parameters. 

With this reformulation of the joint distribution, we observe 
that the conditional distribution P{y\x) is directly proportional 
to the product of two distributions. The prior distribution 
P{yt\yl~^) is obtained using a generative RNN (Section 
[IFbT] ) and the posterior distribution over note-combinations 
P{yt\xt) can be modelled using any frame based classifier. 
The hybrid RNN graphical model is similar to an HMM, where 
the state transition probabilities for the HMM P{yt\yt-i) have 
been generalised to include connections from all previous 
outputs, resulting in the P{yt\yl~^) terms in Equation 


12 


For the problem of automatic music transcription, the input 
time-frequency representation forms the input sequence x, 
while the output piano-roll sequence y denotes the transcrip¬ 
tions. The priors P{yt\y\^^) are obtained from the RNN- 
NADE MEM, while the posterior distributions P{yt\xt) are 
obtained from the acoustic models. The models can then be 
trained by finding the derivatives of the acoustic and language 
model objectives with respect to the model parameters and 
training using gradient descent. The independent training of 


the acoustic and language models is a useful property since 
datasets available for music transcription are considerably 
smaller in size as compared to datasets in computer vision and 
speech. However large corpora of MIDI music are relatively 
easy to find on the internet. Therefore in theory, the MLMs 
can be trained on large corpora of MIDI music, analogous to 
language model training in speech. 


B. Inference 

At test time, we would like to find the mode of the 
conditional output distribution: 


y* = argmaxP(y|x) 
y 


(13) 


From Equation 12 we observe that the priors Piyjfy^ ), 


tie the predictions of the acoustic model P{yt\xt) to all the 
predictions made till time t. This prior term encourages coher¬ 
ence between predictions over time and allows musicological 
structure learnt by the language models to infiuence successive 
predictions. However, this more general structure leads to 
a more complex inference (or decoding) procedure at test 
time. This is due to the fact that at time t, the history 
has not been optimally determined. Therefore, the optimum 
choice of yt depends on all the past model predictions. 
Proceeding greedily in a chronological manner by selecting 
yt that optimises P{yt\xt) does not necessarily yield good 
solutions. We are interested in solutions that globally optimise 
p{y\x). But exhaustively searching for the best sequence is 
intractable since the number of possible configurations of yt is 
exponential in the number of output pitches {2'^ for n pitches). 

Beam search is a graph search algorithm that is commonly 
used to decode the conditional outputs of an RNN (391, 
(m, (261 . Beam search scales to arbitrarily long sequences 
and the computational cost versus accuracy trade-off can be 
controlled via the width of the beam. The inference algo¬ 
rithm is comprised of the following steps: at any time t, 
the algorithm maintains at most w partial solutions, where 
w is the beam width or the beam capacity. The solutions 
in the beam at t correspond to sub-sequences of length t. 
Next, all possible descendants of the w partial solutions in 
the beam are enumerated and then sorted in decreasing order 
of log-likelihood. From these candidate solutions, the top w 
solutions are retained as beam entries for further search. Beam 
search can be readily applied to problems where the number 
of candidate solutions at each step is limited, like speech 
recognition (401 and audio chord estimation (26l . However, 
using beam search for decoding sequences with a large output 
space is prohibitively inefficient. 

When the space of candidate solutions is large, the algorithm 
can be constrained to consider only K new candidates for 
each partial solution in the beam, where K is known as the 
branching factor. The procedure for selecting the K candidates 
can be designed according to the given problem. For the hybrid 
architecture, from Equation we note: 


P{yl\xl) (X P{yl Vo ^)P{yt\yl ^)P{yt\xt) (14) 
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At time t, the partial solutions in the beam correspond to 
configurations of . Therefore given |xq“^), the K 

configurations that maximise P{yt\y^Q~^)P{yt\xt) would be 
a suitable choice of candidates for yt. However for many 
families of distributions, it might not be possible to enumerate 
yt in decreasing order of likelihood. In 03, the authors 
propose forming a pool of K candidates by drawing random 
samples from the conditional output distributions. However, 
random sampling can be inefficient and obtaining independent 
samples can be very expensive for many types of distributions. 
As an alternative, we propose to sample solutions from the 
posterior distribution of the acoustic model P{yt\xt) 1201 . 
There are 2 main motivations for doing this. Firstly, the outputs 
of the acoustic model are independent class probabilities. 
Therefore, it is easy to enumerate samples in decreasing order 
of log-likelihood d. Secondly, we avoid the accumulation 
of errors in the RNN predictions over time ED. The RNN 
models are trained to predict yt, given the true outputs 
However at test time, outputs sampled from the RNN are fed 
back as inputs at the next time step. This discrepancy between 
the training and test objectives can cause prediction errors to 
accumulate over time. 


Algorithm 1 High Dimensional Beam Search 

Find the most likely sequence y given x with a beam width 

w and branching factor K. 

beam ^ new beam object 

6eam.insert(0, {}) 

for t = 1 to T do 

newjbeam ^ new beam object 
for /, 5, rria^ mi in beam do 
for /c = 1 to AT do 

y' = ma.next_most_probable() 

V = \ogPi{y'\s)Pa{y'\xt) - \ogP{y') 
m'l ^ mi with yt := y' 
m'^ ^ ma with x := 
new_beam.msevt(l + /', {s, y'},ma,mi) 
beam ^ newjbeam 
return beam.popQ 


log-likelihood of s, ma^mi are acoustic and language model 
objects and fh is the hash function. 

There are two key differences between Algorithm 1 and the 
algorithm in 1201 . First, the priority queue that stores the beam 
is replaced by a hash table beam object (see Algorithm 2). 
Secondly, for each entry in the beam we evaluate K candidate 
solutions. This is in contrast to the algorithm in 1^ . where 
once the beam is full, only w candidate solutions are evaluated 
per iteration. It might appear that the hashed beam search 
algorithm might be more expensive, since it evaluates w ^ K 
candidates instead of w candidates. However, by efficiently 
pruning similar solutions, the algorithm yields better results for 
much smaller values of w, resulting in a significant increase 
in efficiency (Section IV-F[ Figure [^. 


Algorithm 2 Description of beam objects given w, fh^k 
Initialise beam object 
beam.hashQ = defaultdict of priority queues* 
beam.queue = indexed priority queue of length u;** 

Insert /, s into beam 
key= fh{s) 
queue = beam.queue 
hashQ = beam.hashQ [key] 

fits_in_queue = not queue.full() or I >queue.min() 
fits_in_hashQ = not hashQ.full() or I >hashQ.min() 
if fits_in_queue and fits_in_hashQ then 
hashQ.insert(/, s) 
if hashQ.overfullO then 
item = hashQ. del_min() 
queue.remove(item) 
queue.insert(/, s) 
if queue.overfullO then 
item = queue. del_min() 
beam.hashQ [//i (item. 5)] .remove(item) 

* A priority queue of length k maintains the top k entries 
at all times. 

** An indexed priority queue allows efficient random access 
and deletion. 


Although generating candidates from the acoustic model 
yields good results, it requires the use of large beam widths. 
This makes the inference procedure computationally slow and 
unsuitable for real-time applications ll^ . In this study, we 
propose using the hashed beam search algorithm proposed in 
(2611 . Beam search is fundamentally limited when decoding 
long temporal sequences. This is due to the fact that solutions 
that differ at only a few time-steps, can saturate the beam. This 
causes the algorithm to search a very limited space of possible 
solutions. This issue can be solved by efficient pruning. The 
hashed beam search algorithm improves efficiency by pruning 
solutions that are similar to solutions with a higher likelihood. 
The metric that determines the similarity of sequences can be 
chosen in a problem dependent manner and is encoded in the 
form of a locality sensitive hash function (261 . In Algorithm 
1, we outline the beam search algorithm algorithm used for 
our experiments, while Algorithm 2 describes the hash table 
beam object. In Algorithms 1 and 2, s is a sequence y^, I is 


Algorithm 2 describes the hash table beam object. The 
hashed beam search algorithm offers several advantages com¬ 
pared to the standard beam search algorithm. The notion of 
similarity of solutions can be encoded in the form of hash 
functions. For music transcription, we choose the similarity 
function to be the last n frames in a sequence s. n = 1 
corresponds to a dynamic programming like decoding (similar 
to HMMs) where all sequences with the same final state yt are 
considered to be equivalent, and the sequence with the highest 
log-likelihood is retained, n = len(sequence) corresponds 
to regular beam search. Additionally, the hash beam search 
algorithm can maintain > 1 solution per hash key through a 
process called chaining (421. 

IV. Evaluation 

In this section we describe how the performance of the 
proposed model is evaluated for a polyphonic transcription 
task. 
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A. Dataset 


We evaluate the proposed model on the MAPS dataset EH. 
The dataset consists of audio and corresponding annotations 
for isolated sounds, chords and complete pieces of piano 
music. For our experiments, we use only the full musical 
pieces for training and testing the neural network acoustic 
models and MLMs. The dataset consists of 270 pieces of 
classical music and MIDI annotations. There are 9 cate¬ 
gories of recordings corresponding to different piano types 
and recording conditions, with 30 recordings per category. 7 
categories of audio are produced by software piano synthesis¬ 
ers, while 2 sets of recordings are obtained from a Yamaha 
Disklavier upright piano. Therefore the dataset consists of 210 
synthesised recordings and 60 real recordings. 


We perform 2 sets of investigations in this paper. The first 
set of experiments investigate the effect of the RNN MLMs 
on the predictions of the acoustic models. For this task, we 
divide the entire dataset set into 4 disjoint train/test splits, as to 
ensure that the folds are music piece-independent. Specifically, 
for some of the musical pieces in the dataset, audio for each 
piece is rendered using more than one piano. Therefore while 
creating the splits, we ensure that the training and test data do 
not contain any overlapping piece^ For each split, we select 
80% of the data for training (216 musical pieces) and the 
remaining for testing (54 pieces). From each training split, we 
hold out 26 tracks as a validation set for selecting the hyper¬ 
parameters for the training algorithm (Section IV-D ). All the 
reported results are mean values of the evaluation metrics over 
the 4 splits. From now on, this evaluation configuration will 
be named as Configuration 1. 


Although the above experimental setup is useful for inves¬ 
tigating the effectiveness of the RNN MLMs, the training set 
contains examples from piano models which are used for test¬ 
ing. This is usually not true in practice, where the instrument 
models/sources at test time are unknown and usually do not 
coincide with the instruments used for training. A majority 
of experiments with the MAPS dataset train and test model 
on disjoint instrument types o, na, El. We thus perform 
a second set of experiments to compare performance of the 
different neural network acoustic models in a more realistic 
setting. We train the acoustic models using the 210 tracks 
created using synthesized pianos (180 tracks for training and 
30 tracks for validation) and we test the acoustic models on 
the 60 audio recordings obtained from Yamaha Disklavier 
piano recordings (models ‘ENSTDkAm’ and ‘ENSTDkCT in 
the MAPS database). In this experiment, we do not apply 
the language models since the train and test sets contain 
overlapping musical pieces. In addition to the neural network 
acoustic models, we include comparisons with two state- 
of-the-art unsupervised acoustic models 10, 0 for both 
experiments. This instrument source-independent evaluation 
configuration will be named from now on as Configuration 
2 . 


^Details available at: http://www.eecs.qmuLac.uk/~sss31/TASLP/info.html 


B. Metrics 

We use both frame and note based metrics to assess the 
performance of the proposed system El Frame-based eval¬ 
uations are made by comparing the transcribed binary output 
and the MIDI ground truth frame-by-frame. For note-based 
evaluation, the system returns a list of notes, along with the 
corresponding pitches, onset and offset time. We use the F- 
measure, precision, recall and accuracy for both frame and 
note based evaluation. Formally, the frame-based metrics are 
defined as: 


v = ^ __ 

^ TP[t] + FP[t] 

n = Y __ 

^ TP[t] + FN[t] 

A = ^ __ 

^ TP[t] + FP[t] + FN[t] 

_ 2^V 

~ v^n 

where TP[t] is the number of true positives for the event at 
t, FP is the number of false positives and FN is the number of 
false negatives. The summation over T is carried out over the 
entire test data. Similarly, analogous note-based metrics can 
be defined El A note event is assumed to be correct if its 
predicted pitch onset is within a ±50 ms range of the ground 
truth onset. 


C. Preprocessing 

We transform the input audio to a time-frequency rep¬ 
resentation which is then input to the acoustic models. In 
||2Q1| , we used the magnitude short-time Fourier transform 
(STFT) as input to the acoustic models. However, here we 
experiment with the constant Q transform (CQT) as the input 
representation. There are two motivations for this. Firstly, 
the CQT is fundamentally better suited as a time-frequency 
representation for music signals, since the frequency axis is 
linear in pitch 1461 . Another advantage of using the CQT is 
that the resulting representation is much lower dimensional 
than the STFT. Having a lower dimensional representation 
is useful when using neural network acoustic models as it 
reduces the number of parameters in the model. 

We downsample the audio to 16 kHz from 44.1 kHz. We 
then compute CQTs over 7 octaves with 36 bins per octave 
and a hop size of 512 samples, resulting in a 252 dimensional 
input vector of real values, with a frame rate of 31.25 frames 
per second. Additionally, we compute the mean and standard 
deviation of each dimension over the training set and transform 
the data by subtracting the mean and diving by the standard 
deviation. These pre-processed vectors are used as inputs to the 
acoustic model. For the language model training, we sample 
the MIDI ground truth transcriptions of the training data at 
the same rate as the audio (32 ms). We obtain sequences of 
88 dimensional binary vectors for training the RNN-NADE 







language models. The 88 outputs correspond to notes A0-C8 
on a piano. 

The test audio is sampled at a frame rate of 100 Hz yielding 
100 * 30 = 3000 frames per test file. For 54 test files over 4 
splits, we obtain a total of 648,000 frames at test tim^ 

D. Network Training 

In this section we describe the details of the training 
procedure for the various acoustic model architectures and the 
RNN-NADE language model. All the acoustic models have 
88 units in the output layer, corresponding to the 88 output 
pitches. The outputs of the final layer are transformed by 
a sigmoid function and yield independent pitch probabilities 
P{yt{i) = l\x). All the models are trained by maximising the 
log-likelihood over all the examples in the training set. 

1) DNN Acoustic Models: For DNN training, we constrain 
all the hidden layers of the model to have the same number 
of units to simplify searching for good model architectures. 
We perform a grid search over the following parameters: 
number of layers L G {1,2, 3,4}, number of hidden units 
H G {25, 50,100,125,150, 200, 250}, hidden unit activations 
act G {ReLU^ sigmoid} where ReLU is the rectified linear 
unit activation function ||48]| . We found Dropout B9l to be 
essential for improving generalisation performance. A Dropout 
rate of 0.3 was used for the input layer and all the hidden 
layers of the network. Rather than using learning rate and 
momentum update schedules, we use ADADELTA 1^ to 
adapt the learning over iterations. In addition to Dropout, 
we use early stopping to minimise overfitting. Training was 
stopped if the cost over the validation set did not decrease for 
20 epochs. We used mini batches of size 100 for the SGD 
updates. 

2) RNN Acoustic Models: For RNN training, we constrain 
all the hidden layers to have the same number of units. We 
perform a grid search over the following parameters: L G 
{1,2,3}, H G {25, 50,100,150, 200, 250}. We fix the hidden 
activations of the recurrent layers to be the hyperbolic tangent 
function. We found that ADADELTA was not particularly well 
suited for training RNNs. We use an initial learning rate of 
0.001 and linearly decrease it to 0 over 1000 iterations. We 
use a constant momentum rate of 0.9. The training sequences 
are further divided into sub-sequences of length 100. The SGD 
updates are made one sub-sequence at a time, without any 
mini batching. Similar to the DNNs, we use early stopping 
and stop training if validation cost does not decrease after 20 
iterations. In order to prevent gradient explosion in the early 
stages of training, we use gradient clipping El. We clipped 
the gradients, when the norm of the gradient was greater than 
5. 

3) ConvNet Acoustic Models: The input to the ConvNet 
is a context window of frames and the target is the central 
frame in the window 1^ . The frames at the beginning and 
end of the audio are zero padded so that a context window 
can be applied to each frame. Although pooling can be 

^It should be noted that carrying out statistical significance tests on a track 
level is an over-simplification in the context of multi-pitch detection, as argued 
in (47). 


performed along both axes, we only perform pooling over 
the frequency axis. We performed a grid search over the 
following parameters: window size Wq G {3, 5, 7,9} number 
of convolutional layers Lc G {1,2, 3,4}, number of filters per 
layer ni G {10,25,50,75,100}, number of fully connected 
layers Lfc G {1,2,3}, number of hidden units in fully 
connected layers H G {200,500,1000}. The convolution 
activation functions were fixed to be the hyperbolic tangent 
functions, while all the fully connected layer activations were 
set to the sigmoid function. The pooling size is fixed to be 
P = (1,3) for all convolutional layers. Dropout with rate 
0.5 is applied to all convolutional layers. We tried a large 
permutation of window shapes for the convolutional layer and 
the following subset of window shapes yielded good results: 
w G {(3,3), (3, 5), (5, 5), (3, 25), (5, 25), (3, 75), (5, 75)}. We 
observed that classification performance deteriorated sharply 
for longer filters along the frequency axis. 0.5 Dropout was 
applied to all the fully connected layers. The model parameters 
were trained with SGD and a batch size of 256. An initial 
learning rate of 0.01 was linearly decreased to 0 over 1000 
iterations. A constant momentum rate 0.9 was used for all the 
updates. We stopped training if the validation error did not 
decrease after 20 iterations over the entire training set. 

4) RNN-NADE Language Models: The RNN-NADE 
models were trained with SGD and with sequences of 
length 100. We performed a grid search over the fol¬ 
lowing parameters: number of recurrent units Hrnn ^ 
{50,100,150, 200, 250,300} and number of hidden units for 
the NADE Hnade e {50,100,150,200,250,300}. The 
model was trained with an initial learning rate of 0.001 which 
was linearly reduced to 0 over 1000 iterations. A constant 
momentum rate of 0.9 was applied throughout training. 

We selected the model architectures by performing a grid 
search over the parameter values described earlier in the 
section. The various models were evaluated on one train/test 
split and the best performing architecture was then used for 
all other experiments. 

E. Comparative Approaches 

Eor comparative purposes, two state-of-the-art polyphonic 
music transcription methods were used for experiments Gl, 
ISll . In both cases, the non-binary pitch activation output of 
the aforementioned methods was extracted, for performing an 
in-depth comparison with the proposed neural network models. 
The multi-pitch detection method of ||8l is based on non¬ 
negative matrix factorization (NME) and operates by decom¬ 
posing an input time-frequency representation as a series of 
basis spectra (representing pitches) and component activations 
(indicating pitch activity across time). This method models 
each basis spectrum as a weighted sum of narrowband spectra 
representing a few adjacent harmonic partials, enforcing har- 
monicity and spectral smoothness. As input time-frequency 
representation, an Equivalent Rectangular Bandwidth (ERB) 
filterbank is used. Since the method relies on a dictionary of 
(hand-crafted) narrowband harmonic spectra, system parame¬ 
ters remain the same for the two evaluation configurations. 

The multiple-instrument transcription method of 13 is based 
on shift-invariant PLCA (a convolutive and probabilistic coun- 
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Post Processing 

Thresholding 

HMM 

Hybrid Architecture 

Acoustic Model 

Frame 

Note 

Frame 

Note 

Frame 

Note 

Benetos 13] 

64.20 

65.22 

64.84 

66.05 

65.10 

66.48 

Vincent 18] 

58.95 

68.5 

60.37 

68.87 

59.78 

69.00 

DNN 

67.54 

60.02 

68.32 

62.26 

67.92 

63.18 

RNN 

68.38 

63.84 

68.09 

64.50 

69.25 

65.24 

ConvNet 

73.57 

65.35 

73.75 

66.20 

74.45 

67.05 


TABLE I 

F-measures eor multiple pitch detection on the maps dataset, using evaluation coneiguration 1. 



V 

7^ 

A 

Acoustic Model 

Frame 

Note 

Frame 

Note 

Frame 

Note 

Benetos 13] 

59.54 

73.51 

69.51 

60.67 

48.47 

49.03 

Vincent 18] 

52.71 

79.93 

69.04 

60.69 

43.04 

52.92 

DNN 

65.66 

62.62 

70.34 

63.75 

51.76 

45.33 

RNN 

67.89 

64.64 

70.66 

65.85 

54.38 

48.18 

ConvNet 

72.45 

67.75 

76.56 

66.36 

58.87 

50.07 


TABLE II 

Precision, Recall and Accuracy eor multiple pitch detection on the MAPS dataset using the hybrid architecture 
(k; = 10, a: = 4, = 2, fhivo) = yt), USING EVALUATION CONEIGURATION 1. 


Acoustic Model 

Benetos 13] 

Vincent 18] 

DNN 

RNN 

ConvNet 

F-measure (Frame) 

59.31 

59.60 

59.91 

57.67 

64.14 

F-measure (Note) 

54.29 

59.12 

49.43 

49.20 

54.89 


TABLE III 

F-measures eor acoustic models trained on synthesised pianos and tested on real recordings (evaluation coneiguration 2). 


terpart of NMF). In this model, the input time-frequency 
representation is decomposed into a series of basis spectra 
per pitch and instrument source which are shifted across 
log-frequency, thus supporting tuning changes and frequency 
modulations. Outputs include the pitch activation distribution 
and the instrument source contribution per pitch. Contrary to 
the parametric model of IH, the basis spectra are pre-extracted 
from isolated musical instrument sounds. As in the proposed 
method, the input time-frequency representation of O is the 
CQT. For the investigations with MLMs (configuration 1), 
the PLCA models are trained on isolated sound examples 
from all 9 piano models from the MAPS database (in order 
for the experiments to be comparable with the proposed 
method). For the second set of experiments which investigate 
the generalisation capabilities of the models (configuration 2), 
the PLCA acoustic model is trained on isolated sounds from 
the sysnthesised pianos and tested on recordings created using 
the Yamaha Disklavier piano. 

F. Results 

In this section we present results from the experiments on 
the MAPS dataset. As mentioned before, all results are the 
mean values of various metrics computed over the 4 different 
train/test splits. The acoustic models yield a sequence of 
probabilities for the individual pitches being active (posteri- 
ograms). The post-processing methods are used to transform 
the posteriograms to a binary piano-roll representation. The 


Model 

Architecture 

DNN 

L = 3,/f = 125 

RNN 

L = 2,H = 200 

ConvNet 

ws = 7, Lc = 2, L/C = 2, wi = (5, 25), Pi = (1, 3) 

W2 = (3, 5), P2 = (1, 3), m = n2 = 50, hi = 1000, L 2 = 200 

RNN-NADE 

Hrnn = “^00, Hnade = 150 


TABLE IV 

Model coneigurations eor the best pereorming architectures. 


various performance metrics (both frame and note based) are 
then computed by comparing the outputs of the systems to the 
ground truth. 

We consider 3 kinds of post-processing methods. The sim¬ 
plest post-processing method is to apply a threshold to the 
output pitch probabilities obtained from the acoustic model. 
We select the threshold that maximises the F-measure over the 
entire training set and use this threshold for testing. Pitches 
with probabilities greater than the threshold are set to 1, while 
the remaining pitches are set to 0. The second post-processing 
method considered uses individual pitch HMMs for post¬ 
processing similar to ca. The HMM parameters (transition 
probabilities, pitch marginals) are obtained by counting the 
frequency of each event over the MIDI ground truth data. The 
binary pitch outputs are obtained using Viterbi decoding |[38]| , 
where the scaled likelihoods are used as emission probabilities. 
Finally, we combine the acoustic model predictions with the 
RNN-NADE MLMs and obtain binary transcriptions using 
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Fig. 3. Effect of beam width (w) on F-measure using evaluation Configura¬ 
tion k = 2,K = A, fh = yt. 


beam search. 

In Table |I| we present F-scores (both frame and note 
based) for all the acoustic models and the 3 post-processing 
methods using Configuration 1. From the table, we note that 
all the neural network models outperform the PLCA and NMF 
models in terms of frame-based F-measure by 3% — 9%. 
The DNN and RNN acoustic model performance is similar, 
while the ConvNet acoustic model clearly outperforms all the 
other models. The ConvNets yield an absolute improvement 
of ^ 5% over the other neural network models, while out¬ 
performing the spectrogram factorisation models by ^ 10% 
in frame-wise F-measure. For the note-based F-measure, the 
RNN and ConvNet models perform better than the DNN 
acoustic model. This is largely due to the fact that these models 
include context information in their inputs, which implicitly 
smooths the output predictions. 

We compare the different post-processing methods for Con¬ 
figuration 1 by observing the rows of Table |I] We note that 
the MLM leads to improved performance on both frame-based 
and note-based F-measure for all the acoustic models. The 
performance increase is larger on the note-based F-measure. 
The relative improvement in performance is maximum for the 
DNN acoustic model, compared to the RNN and the ConvNet. 
This could be due to the fact that the independence assumption 
in Equation is violated by the RNN and ConvNet, which 
include context information while making predictions. This 
leads to some factors being counted twice and we observe a 
smaller performance improvement in this case. From Rows 
1 and 2 of Table U we observe that the RNN-NADE MLM 
yields a performance increase for the PLCA and NML acoustic 
models, though the relative improvement is less as compared to 
the neural network acoustic models. This might be due to the 
fact that unlike the neural network models, these models are 
not trained to maximise the conditional probability of output 
pitches given the acoustic inputs. Another contributing factor 
is the fact that the PLCA and NML posteriograms represent the 
energy distribution over pitches rather than explicit pitch prob¬ 


abilities, which results in many activations being greater than 
1. This discrepancy in the scale of the acoustic and language 
predictions leads to an unequal weighting of predictions when 
used in the hybrid RNN framework. In Table |I] we observe that 
the acoustic model in 0 outperforms all other acoustic models 
on the note-based L-measure, while the frame based L-measure 
is significantly lower. This can be attributed to the use of an 
LRB filterbank input representation, which offers improved 
temporal resolution over the CQT for lower frequencies. 

In Table ^ we present additional metrics (precision, recall 
and accuracy) for the all the acoustic models after decoding 
with an RNN-MLM, using Configuration 1. We observe that 
that the NML and PLCA models have low frame-based 
precision and high recall and the converse for the note- 
based precision. Lor the neural network models, we observe 
smaller differences between the both frame-based and note- 
based precision and recall values. Amongst all the neural 
network models, we observe that the ConvNet outperforms 
all the other models on all the metrics. 

In Table |In| we present L-measures for experiments where 
the acoustic models are trained on synthesised data and tested 
on real data (Configuration 2). Lrom the table we note that 
frame based L-measure for the DNN and RNN models is 
similar to the PLCA model and the model in (H. We note 
that the ConvNet outperforms all other models on the frame- 
based L-measure by ^ 5%. On the note based evaluations, 
we observe that both RNN and DNN are outperformed by all 
the other models. The ConvNet performance is similar to the 
PLCA model, while the acoustic model from 0| again has 
best performance on the note based metrics. 

We now discuss details of the inference algorithm. The high 
dimensional hashed beam search algorithm has the following 
parameters: the beam width w, the branching factor K, number 
of entries per hash table entry k and the similarity metric 
fh (Algorithm 2). We observed that a value of AT > 4 
produced good results. Larger values of K do not yield a 
significant performance increase and result in much longer 
run times, therefore we set Ff = 4 for all experiments. 
We observed that small values of k (number of solutions 
per hash table entry), 1 < k < 4 produced good results. 
Decoding accuracies deteriorate sharply for large values of k, 
as observed in 1^ . Therefore, we set the number of entries 
per hash key /c = 2 for all experiments. We let the similarity 
metric be the last n emitted symbols, fhiVo) = Vt-n- 
experimented with varying the values of n and observed 
that we were able to achieve good performance for small n, 
1 < n < 5. We did not observe any performance improvement 
for large n, therefore for all experiments we fix fhiVo) = Vt- 
Ligure is a plot showing the effect of beam width w on 
transcription performance. The results are average values of 
decoding accuracies over 4 splits. We compare performance 
of the hashed beam search with the high dimensional beam 
search in (201. Lrom Ligure we observe that the hashed beam 
search algorithm is able to achieve performance improvement 
with significantly smaller beam-widths. Lor instance, the high 
dimensional beam search algorithm takes 20 hours to decode 
the entire test set with w = 100, while the hashed beam search 
takes 22 minutes, with w = 10 and achieves better decoding 
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(a) ConvNet Posteriogram 


(b) ConvNet Transcription 



(c) Ground Truth 

Fig. 4. a) Pitch-activation (posteriogram) matrix for the first 30 seconds of track MAPS_MUS-chpn_op27_2_AkPnStgb produced by a ConvNet acoustic 
model, b) Binary piano-roll transcription obtained from posteriogram in a) after post processing with RNN MLM and beam search, c) Corresponding ground 
truth piano roll representation. 


accuracy. 

Figure is a graphical representation of the outputs of a 
ConvNet acoustic model. We observe that some of the longer 
notes are fragmented and the offsets are estimated incorrectly. 
One reason for this is that the ground truth offsets don’t 
necessarily correspond to the offset in the acoustic signal (due 
to effects of the sustain pedal), implying noisy offsets in the 
ground truth. We also observe that the model does not make 
many harmonic errors in its predictions. 

V. Conclusions and Future Work 

In this paper, we present a hybrid RNN model for poly¬ 
phonic AMT of piano music. The model comprises a neural 
network acoustic model and an RNN based music language 
model. We propose using a ConvNet for acoustic modelling. 


which to the best of the authors’ knowledge, has not been 
attempted before for AMT. Our experiments on the MAPS 
dataset demonstrate that the neural network acoustic models, 
especially the ConvNet, outperform 2 popular acoustic models 
from the AMT literature. We also observe that the RNN MLMs 
consistently improve performance on all evaluation metrics. 
The proposed inference algorithm with the hash beam search 
is able to yield good decoding accuracies with significantly 
shorter run times, making the model suitable for real-time 
applications. 

We now discuss some of the limitations of the proposed 
model. As discussed earlier, one of the main contributing fac¬ 
tors to the success of deep neural networks has been the avail¬ 
ability of very large datasets. However datasets available for 
AMT research are considerably smaller than datasets available 
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in speech, computer vision and natural language processing 
(NLP). Therefore the applicability of deep neural networks for 
acoustic modelling is limited to datasets with large amounts 
of labelled data, which is not common in AMT (at least 
in non-piano music). Although the neural network acoustic 
models perform competitively, their performance could be 
further improved in many ways. Noise or deformations can 
be added to training examples to encourage the classifiers to 
be invariant to commonly encountered input transformations. 
Additionally, the CQT input representation can be replaced by 
a representation with higher temporal resolution (like the ERB 
or a variable-Q transform), to improve performance on note 
based metrics. 

The abundance of musical score data and recent progress 
in NLP tasks with neural networks provide strong motivation 
for further investigations into MLMs for AMT. Although 
our results demonstrate some improvement in transcription 
performance with MLMs, there are several limitations and 
open questions that remain. The MLMs are trained on binary 
vectors sampled from the MIDI ground truth. Depending on 
the sampling rate, most note events are repeated many times 
in this representation. The MLMs are trained to predict the 
next frame of notes, given an input sequence of binary note 
combinations. In cases where the same notes are repeated 
many times, log-likelihood can be trivially maximised by 
repeating previous inputs. This causes the MLM to perform a 
smoothing operation, rather than imposing any kind of musical 
structure on the outputs. A potential solution would be to 
perform beat-aligned language modelling for the training and 
the test data, rather than sampling the MIDI at some arbitrary 
sampling rate. Additionally, RNNs can be extended to include 
duration models for each of their pitch outputs, similar to 
second order HMMs. However, this is a challenging problem 
and currently remains unexplored. It would also be interesting 
to encourage RNNs to learn longer temporal note patterns by 
interfacing RNN controllers with external memory units |[5^ 
and also to incorporate a notion of timing or metre in the input 
representation for the MLMs. 

The effect of tonality on the performance of the MLMs 
should be further investigated. The MLMs should ideally be 
invariant to transpositions of a musical piece to different 
pitches. The MIDI ground truth can be easily transposed to 
any tonality. MLMs can be trained on inputs with transposed 
tonalities or individual MLMs for each key can be trained. 
Additionally, the fully connected input layer of the RNN MLM 
can be substitued with a convolutive layer, with convolutions 
along the pitch axis to encourage the network to be invariant 
to pitch transpositions. 

Another limitation of the proposed hybrid model is that the 
conditional probability in Equation is derived by assuming 
that the predictions at time t are only a function of the input 
at t and independent of all other inputs and outputs. The 
violation of this assumption leads to certain factors being 
counted twice and therefore reduces the impact of the MLMs. 
The results clearly demonstrate that improvements with the 
MLM are maximum when the acoustic model is frame-based. 
The improvements are comparatively lower when combined 
with predictions from an RNN or ConvNet acoustic model. 


This is problematic since the ConvNet acoustic models yield 
the best performance. 
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